ROCm と HIP：GPUパフォーマンスのメモリ中心性を学ぶ詳細10章チュートリアル

GPU加速では、"計算最優先"の考え方を捨てなければなりません。現代の性能は メモリ管理によって決まります。これはホスト（CPU）とデバイス（GPU）間でのデータの割り当て、同期、最適化を統合的に管理することです。

1. メモリと計算のギャップ

GPUの算術演算スループット（$TFLOPS$）が飛躍的に向上している一方、メモリ帯域幅（$GB/s$）ははるかに遅い速度でしか成長していません。このため、実行ユニットがしばしば「空腹状態」になり、VRAMからデータが到着するのを待たなければならないギャップが生じます。結果として、 GPUプログラミングは実際にはメモリプログラミングであるのです。

2. ルーフラインモデル

このモデルは 算術強度 （FLOPs/バイト）と性能の関係を可視化します。アプリケーションは通常、以下の2つのカテゴリに分類されます：

メモリ制限型： 帯域幅に制限される（急勾配）。
計算制限型： ピークのTFLOPSに制限される（水平な天井）。

3. データ移動のコスト

主なパフォーマンスのボトルネックは数学自体ではなく、1バイトをPCIeバスやHBM間で移動させる際のレイテンシとエネルギー消費です。高性能なコードはデータの滞在位置を重視し、ホストとデバイス間の転送を最小限に抑えます。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.